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Abstract 


This work attempts to explain the types of 
computation that neural networks can perform 
by relating them to automata. We first define 
what it means for a real-time network with 
bounded precision to accept a language. A 
measure of network memory follows from this 
definition. We then characterize the classes of 
languages acceptable by various recurrent net- 
works, attention, and convolutional networks. 
We find that LSTMs function like counter ma- 
chines and relate convolutional networks to the 
subregular hierarchy. Overall, this work at- 
tempts to increase our understanding and abil- 
ity to interpret neural networks through the 
lens of theory. These theoretical insights help 
explain neural computation, as well as the rela- 
tionship between neural networks and natural 
language grammar. 


1 Introduction 


In recent years, neural networks have achieved 
tremendous success on a variety of natural lan- 
guage processing (NLP) tasks. Neural networks 
employ continuous distributed representations of 
linguistic data, which contrast with classical dis- 
crete methods. While neural methods work well, 
one of the downsides of the distributed representa- 
tions that they utilize is interpretability. It is hard 
to tell what kinds of computation a model is capa- 
ble of, and when a model is working, it is hard to 
tell what it is doing. 

This work aims to address such issues of inter- 
pretability by relating sequential neural networks 
to forms of computation that are more well un- 
derstood. In theoretical computer science, the 
computational capacities of many different kinds 
of automata formalisms are clearly established. 
Moreover, the Chomsky hierarchy links natural 
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language to such automata-theoretic languages 
(Chomsky, 1956). Thus, relating neural networks 
to automata both yields insight into what general 
forms of computation such models can perform, 
as well as how such computation relates to natural 
language grammar. 

Recent work has begun to investigate what 
kinds of automata-theoretic computations various 
types of neural networks can simulate. Weiss et al. 
(2018) propose a connection between long short- 
term memory networks (LSTMs) and counter au- 
tomata. They provide a construction by which 
the LSTM can simulate a simplified variant of a 
counter automaton. They also demonstrate that 
LSTMs can learn to increment and decrement their 
cell state as counters in practice. Peng et al. 
(2018), on the other hand, describe a connec- 
tion between the gating mechanisms of several re- 
current neural network (RNN) architectures and 
weighted finite-state acceptors. 

This paper follows Weiss et al. (2018) by an- 
alyzing the expressiveness of neural network ac- 
ceptors under asymptotic conditions. We formal- 
ize asymptotic language acceptance, as well as an 
associated notion of network memory. We use 
this theory to derive computation upper bounds 
and automata-theoretic characterizations for sev- 
eral different kinds of recurrent neural networks 
(Section 3), as well as other architectural vari- 
ants like attention (Section 4) and convolutional 
networks (CNNs) (Section 5). This leads to a 
fairly complete automata-theoretic characteriza- 
tion of sequential neural networks. 

In Section 6, we report empirical results in- 
vestigating how well these asymptotic predic- 
tions describe networks with continuous activa- 
tions learned by gradient descent. In some cases, 
networks behave according to the theoretical pre- 
dictions, but we also find cases where there is gap 
between the asymptotic characterization and ac- 


tual network behavior. 

Still, discretizing neural networks using an 
asymptotic analysis builds intuition about how the 
network computes. Thus, this work provides in- 
sight about the types of computations that sequen- 
tial neural networks can perform through the lens 
of formal language theory. In so doing, we can 
also compare the notions of grammar expressible 
by neural networks to formal models that have 
been proposed for natural language grammar. 


2 Introducing the Asymptotic Analysis 


To investigate the capacities of different neural 
network architectures, we need to first define what 
it means for a neural network to accept a language. 
There are a variety of ways to formalize language 
acceptance, and changes to this definition lead to 
dramatically different characterizations. 

In their analysis of RNN expressiveness, Siegel- 
mann and Sontag (1992) allow RNNs to perform 
an unbounded number of recurrent steps even af- 
ter the input has been consumed. Furthermore, 
they assume that the hidden units of the network 
can have arbitrarily fine-grained precision. Un- 
der this very general definition of language accep- 
tance, Siegelmann and Sontag (1992) found that 
even a simple recurrent network (SRN) can simu- 
late a Turing machine. 

We want to impose the following constraints on 
neural network computation, which are more real- 
istic to how networks are trained in practice (Weiss 
et al., 2018): 


1. Real-time: The network performs one itera- 
tion of computation per input symbol. 


2. Bounded precision: The value of each cell in 
the network is representable by O(log n) bits 
on sequences of length n. 


Informally, a neural sequence acceptor is a net- 
work which reads a variable-length sequence of 
characters and returns the probability that the in- 
put sequence is a valid sentence in some formal 
language. More precisely, we can write: 


Definition 2.1 (Neural sequence acceptor). Let X 
be a matrix representation of a sentence where 
each row is a one-hot vector over an alphabet ©. 
A neural sequence acceptor Î is a family of func- 
tions parameterized by weights 0. For each 0 and 
X, the function 1? takes the form 


1? : X > p € (0,1). 
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Figure 1: With sigmoid activations, the network on the 
left accepts a sequence of bits if and only if x; = 1 for 
some t. On the right is the discrete computation graph 
that the network approaches asymptotically. 


In this definition, 1 corresponds to a general ar- 
chitecture like an LSTM, whereas 1° represents a 
specific network, such as an LSTM with weights 
that have been learned from data. 

In order to get an acceptance decision from 
this kind of network, we will consider what hap- 
pens as the magnitude of its parameters gets very 
large. Under these asymptotic conditions, the in- 
ternal connections of the network approach a dis- 
crete computation graph, and the probabilistic out- 
put approaches the indicator function of some lan- 
guage (Figure 1). 

Definition 2.2 (Asymptotic acceptance). Let L be 
a language with indicator function 1z. A neu- 
ral sequence acceptor Î with weights 9 asymptot- 
ically accepts L if 

lim i%®=1 L. 

N-0co 

Note that the limit of 1? represents the function 
that 1? converges to pointwise. ! 

Discretizing the network in this way lets us an- 
alyze it as an automaton. We can also view this 
discretization as a way of bounding the precision 
that each unit in the network can encode, since it is 
forced to act as a discrete unit instead of a continu- 
ous value. This prevents complex fractal represen- 
tations that rely on infinite precision. We will see 
later that, for every architecture considered, this 
definition ensures that the value of every unit in 
the network is representable in O(log n) bits on 
sequences of length n. 

It is important to note that real neural networks 
can learn strategies not allowed by the asymptotic 
definition. Thus, this way of analyzing neural net- 
works is not completely faithful to their practical 
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usage. In Section 6, we discuss empirical studies 
investigating how trained networks compare to the 
asymptotic predictions. While we find evidence 
of networks learning behavior that is not asymp- 
totically stable, adding noise to the network dur- 
ing training seems to make it more difficult for the 
network to learn non-asymptotic strategies. 

Consider a neural network that asymptotically 
accepts some language. For any given length, we 
can pick weights for the network such that it will 
correctly decide strings shorter than that length 
(Theorem A.1). 

Analyzing a network’s asymptotic behavior also 
gives us a notion of the network’s memory. Weiss 
et al. (2018) illustrate how the LSTM’s additive 
cell update gives it more effective memory than 
the squashed state of an SRN or GRU for solv- 
ing counting tasks. We generalize this concept 
of memory capacity as state complexity. Infor- 
mally, the state complexity of a node within a net- 
work represents the number of values that the node 
can achieve asymptotically as a function of the se- 
quence length n. For example, the LSTM cell state 
will have O(n") state complexity (Theorem 3.3), 
whereas the state of other recurrent networks has 
O(1) (Theorem 3.1). 

State complexity applies to a hidden state se- 
quence, which we can define as follows: 
Definition 2.3 (Hidden state). For any sentence 
X, let n be the length of X. For 1 < t < n, the k- 
length hidden state h; with respect to parameters 
0 is a sequence of functions given by 


hf: X> vy ER". 


Often, a sequence acceptor can be written as a 
function of an intermediate hidden state. For ex- 
ample, the output of the recurrent layer acts as a 
hidden state in an LSTM language acceptor. In re- 
current architectures, the value of the hidden state 
is a function of the preceding prefix of characters, 
but with convolution or attention, it can depend on 
characters occurring after index t. 

The state complexity is defined as the cardinal- 
ity of the configuration set of such a hidden state: 


Definition 2.4 (Configuration set). For all n, the 
configuration set of hidden state h,, with respect 
to parameters 6 is given by 


(ht) = { im BO [n= pb 


where |X| is the length, or height, of the sentence 
matrix X. 


Definition 2.5 (Fixed state complexity). For all n, 
the fixed state complexity of hidden state h, with 
respect to parameters 0 is given by 


M(hf) = |M g) 


Definition 2.6 (General state complexity). For all 
n, the general state complexity of hidden state h, 
is given by 


Mth, ) = max M(hê). 


To illustrate these definitions, consider a sim- 
plified recurrent mechanism based on the LSTM 
cell. The architecture is parameterized by a vector 
0 € IR. At each time step, the network reads a bit 
z+ and computes 


fi = o (01x) (1) 
Ut = a (022) (2) 
hi = fthi—1 + it. (3) 


When we set 0t = (1,1), hy asymptotically 
computes the sum of the preceding inputs. Be- 
cause this sum can evaluate to any integer between 
0 and n, not has a fixed state complexity of 


(nf) = O(n), (4) 


However, when we use parameters 014 = (—1, 1), 
we get a reduced network where hy = x; asymp- 
totically. Thus, 


rH (af") = O(1), (5) 


Finally, the general state complexity is the maxi- 
mum fixed complexity, which is O(n). 

For any neural network hidden state, the state 
complexity is at most 27() (Theorem A.2). This 
means that the value of the hidden unit can be 
encoded in O(n) bits. Moreover, for every spe- 
cific architecture considered, we observe that each 
fixed-length state vector has at most O(n*) state 
complexity, or, equivalently, can be represented in 
O(log n) bits. 

Architectures that have exponential state com- 
plexity, such as the transformer, do so by using 
a variable-length hidden state. State complexity 
generalizes naturally to a variable-length hidden 
state, with the only difference being that h; (Def- 
inition 2.3) becomes a sequence of variably sized 
objects rather than a sequence of fixed-length vec- 
tors. 


Now, we consider what classes of languages 
different neural networks can accept asymptoti- 
cally. We also analyze different architectures in 
terms of state complexity. The theory that emerges 
from these tools enables better understanding of 
the computational processes underlying neural se- 
quence models. 


3 Recurrent Neural Networks 


As previously mentioned, RNNs are Turing- 
complete under an unconstrained definition of ac- 
ceptance (Siegelmann and Sontag, 1992). The 
classical reduction of a Turing machine to an RNN 
relies on two unrealistic assumptions about RNN 
computation (Weiss et al., 2018). First, the num- 
ber of recurrent computations must be unbounded 
in the length of the input, whereas, in practice, 
RNNs are almost always trained in a real-time 
fashion. Second, it relies heavily on infinite pre- 
cision of the network’s logits. We will see that 
the asymptotic analysis, which restricts computa- 
tion to be real-time and have bounded precision, 
severely narrows the class of formal languages that 
an RNN can accept. 


3.1 Simple Recurrent Networks 


The SRN, or Elman network, is the simplest type 
of RNN (Elman, 1990): 


Definition 3.1 (SRN layer). 
h; = tanh(Wx; + Uh; + b). (6) 


A well-known problem with SRNs is that they 
struggle with long-distance dependencies. One ex- 
planation of this is the vanishing gradient problem, 
which motivated the development of more sophis- 
ticated architectures like the LSTM (Hochreiter 
and Schmidhuber, 1997). Another shortcoming of 
the SRN is that, in some sense, it has less mem- 
ory than the LSTM. This is because, while both 
architectures have a fixed number of hidden units, 
the SRN units remain between —1 and 1, whereas 
the value of each LSTM cell can grow unbound- 
edly (Weiss et al., 2018). We can formalize this 
intuition by showing that the SRN has finite state 
complexity: 


Theorem 3.1 (SRN state complexity). For any 
length n, the SRN cell state h, € R* has state 
complexity 


M(hn) < 2* = O(1). 


Proof. For every n, each unit of h, will be the 
output of a tanh. In the limit, it can achieve either 
—1 or 1. Thus, for the full vector, the number of 
configurations is bounded by 2%. 


It also follows from Theorem 3.1 that the lan- 
guages asymptotically acceptable by an SRN are a 
subset of the finite-state (i.e. regular) languages. 
Lemma B.1 provides the other direction of this 
containment. Thus, SRNs are equivalent to finite- 
state automata. 


Theorem 3.2 (SRN characterization). Let 
L(SRN) denote the languages acceptable by an 
SRN, and RL the regular languages. Then, 


L(SRN) = RL. 


This characterization is quite diminished com- 
pared to Turing completeness. It is also more de- 
scriptive of what SRNs can express in practice. We 
will see that LSTMs, on the other hand, are strictly 
more powerful than the regular languages. 


3.2 Long Short-Term Memory Networks 


An LSTM is a recurrent network with a complex 
gating mechanism that determines how informa- 
tion from one time step is passed to the next. 
Originally, this gating mechanism was designed to 
remedy the vanishing gradient problem in SRNs, 
or, equivalently, to make it easier for the network 
to remember long-term dependencies (Hochreiter 
and Schmidhuber, 1997). Due to strong empiri- 
cal performance on many language tasks, LSTMs 
have become a canonical model for NLP. 

Weiss et al. (2018) suggest that another advan- 
tage of the LSTM architecture is that it can use 
its cell state as counter memory. They point out 
that this constitutes a real difference between the 
LSTM and the GRU, whose update equations do 
not allow it to increment or decrement its memory 
units. We will further investigate this connection 
between LSTMs and counter machines. 


Definition 3.2 (LSTM layer). 


fe = o(Wf x, + Uhi +b!) (7) 
it = o(W'x;, + Uy_1 + b’) (8) 
o; = o(W°x; + U°hy_; + b°) (9) 
čą = tanh(W°x; + Uh; + b°) (10) 
ct = fi O criti: O Ct (11) 
h; = 0% © f (ct). (12) 


In (12), we set f to either the identity or tanh 
(Weiss et al., 2018), although tanh is more stan- 
dard in practice. The vector h; is the output that is 
received by the next layer, and c; is an unexposed 
memory vector called the cell state. 


Theorem 3.3 (LSTM state complexity). The 
LSTM cell state cn € RË has state complexity 


Pi(c,) = O(n*). 


Proof. At each time step t, we know that the con- 
figuration sets of f;, i, and o+ are each subsets of 
{0,1}*. Similarly, the configuration set of č is a 
subset of {—1,1}*. This allows us to rewrite the 
elementwise recurrent update as 


Jim [edi = Jim [fe]ilee—aJi + [ili[€e], (13) 
= Jim alez—i]i +b (14) 


where a € {0,1} and b € {—1,0, 1}. 

Let S; be the configuration set of [c;];. At each 
time step, we have exactly two ways to produce a 
new value in S; that was not in S;_,: either we 
decrement the minimum value in S;_; or incre- 
ment the maximum value. It follows that 


|S:| = 2+ [St] (15) 
=> |Sn| = OM). (16) 

For all k units of the cell state, we get 
Mejan]: (17) 


The construction in Theorem 3.3 produces a 
counter machine whose counter and state update 
functions are linearly separable. Thus, we have 
an upper bound on the expressive power of the 
LSTM: 


Theorem 3.4 (LSTM upper bound). Let CL be the 
real-time counter languages (Fischer, 1966; Fis- 
cher et al., 1968). Then, 


L(LSTM) C CL. 


Theorem 3.4 constitutes a very tight upper 
bound on the expressiveness of LSTM computa- 
tion. Asymptotically, LSTMs are not powerful 
enough to model even the deterministic context- 
free language w#w®. 

Weiss et al. (2018) show how the LSTM can 
simulate a simplified variant of the counter ma- 
chine. Combining these results, we see that 


the asymptotic expressiveness of the LSTM falls 
somewhere between the general and simplified 
counter languages. This suggests counting is a 
good way to understand the behavior of LSTMs. 


3.3 Gated Recurrent Units 


The GRU is a popular gated recurrent architecture 
that is in many ways similar to the LSTM (Cho 
et al., 2014). Rather than having separate forget 
and input gates, the GRU utilizes a single gate that 
controls both functions. 


Definition 3.3 (GRU layer). 


Zt = o(W*x; + U*hy 1 + b?) (18) 

r: = o(W"xt + U”hi—ı + b”) (19) 

uz = tanh (W“x; + U” (r, © hi1) + b”) 
(20) 


hy = z © hi1 + (1 — zt) © uy. (21) 


Weiss et al. (2018) observe that GRUs do not 
exhibit the same counter behavior as LSTMs on 
languages like a”b”. As with the SRN, the GRU 
state is squashed between —1 and 1 (20). Taken 
together, Lemmas C.1 and C.2 show that GRUs, 
like SRNs, are finite-state. 


Theorem 3.5 (GRU characterization). 
L(GRU) = RL. 


3.4 RNN Complexity Hierarchy 
Synthesizing all of these results, we get the fol- 
lowing complexity hierarchy: 
RL = L(SRN) = L(GRU) 
C SCL C L(LSTM) C CL. 


(22) 
(23) 


Basic recurrent architectures have finite state, 
whereas the LSTM is strictly more powerful than 
a finite-state machine. 


4 Attention 


Attention is a popular enhancement to sequence- 
to-sequence (seq2seq) neural networks (Bahdanau 
et al., 2014; Chorowski et al., 2015; Luong et al., 
2015). Attention allows a network to recall spe- 
cific encoder states while trying to produce output. 
In the context of machine translation, this mecha- 
nism models the alignment between words in the 
source and target languages. More recent work 
has found that “attention is all you need” (Vaswani 
et al., 2017; Radford et al., 2018). In other words, 


networks with only attention and no recurrent con- 
nections perform at the state of the art on many 
tasks. 

An attention function maps a query vector and a 

sequence of paired key-value vectors to a weighted 
combination of the values. This lookup function is 
meant to retrieve the values whose keys resemble 
the query. 
Definition 4.1 (Dot-product attention). For any n, 
define a query vector q € R!, matrix of key vectors 
K € R”, and matrix of value vectors V € R”*. 
Dot-product attention is given by 


attn(q, K, V) = softmax(qK7)V. 


In Definition 4.1, softmax creates a vector of 
similarity scores between the query q and the key 
vectors in K. The output of attention is thus 
a weighted sum of the value vectors where the 
weight for each value represents its relevance. 

In practice, the dot product qK" is often scaled 
by the square root of the length of the query vector 
(Vaswani et al., 2017). However, this is only done 
to improve optimization and has no effect on ex- 
pressiveness. Therefore, we consider the unscaled 
version. 

In the asymptotic case, attention reduces to a 
weighted average of the values whose keys maxi- 
mally resemble the query. This can be viewed as 
an arg max operation. 


Theorem 4.1 (Asymptotic attention). Let t1, ..,tm 
be the subsequence of time steps that maximize 
qk:.? Asymptotically, attention computes 


m 


1 
li tt K,V) = li — = 
oases) pa eae 


Corollary 4.1.1 (Asymptotic attention with 
unique maximum). If qk; has a unique maximum 
over 1 < t < n, then attention asymptotically 
computes 


lim attn(q,K,V) = lim arg max qk;. 
N-oo Noo Vi 


Now, we analyze the effect of adding attention 
to an acceptor network. Because we are concerned 
with language acceptance instead of transduction, 
we consider a simplified seq2seq attention model 
where the output sequence has length 1: 


To be precise, we can define a maximum over the simi- 
larity scores according to the order given by 


f>g = Jim fW) — 9) > 0. (24) 


Definition 4.2 (Attention layer). Let the hidden 
state V1, .., Vn be the output of an encoder network 
where the union of the asymptotic configuration 
sets over all v; is finite. We attend over V+, the 
matrix stacking v1, .., vz, by computing 


h; = attn(W'v:, Vi, Vi). 


In this model, h; represents a summary of the 
relevant information in the prefix v1, .., vz. The 
query that is used to attend at time t is a simple 
linear transformation of v;. 

In addition to modeling alignment, attention im- 
proves a bounded-state model by providing ad- 
ditional memory. By converting the state of the 
network to a growing sequence V; instead of a 
fixed length vector v;, attention enables 2°”) 
state complexity. 


Theorem 4.2 (Encoder state complexity). The full 
state of the attention layer has state complexity 


P(V,,) = 280, 


The O(n") complexity of the LSTM architec- 
ture means that it is impossible for LSTMs to 
copy or reverse long strings. The exponential state 
complexity provided by attention enables copying, 
which we can view as a simplified version of ma- 
chine translation. Thus, it makes sense that atten- 
tion is almost universal in machine translation ar- 
chitectures. The additional memory introduced by 
attention might also allow more complex hierar- 
chical representations. 

A natural follow-up question to Theorem 4.2 is 
whether this additional complexity is preserved in 
the attention summary vector hp. Attending over 
Vn does not preserve exponential state complex- 
ity. Instead, we get an O(n”) summary of Vp. 


Theorem 4.3 (Summary state complexity). The 
attention summary vector has state complexity 


With minimal additional assumptions, we can 
show a more restrictive bound: namely, that the 
complexity of the summary vector is finite. Ap- 
pendix D discusses this in more detail. 


5 Convolutional Networks 


While CNNs were originally developed for image 
processing (Krizhevsky et al., 2012), they are also 


used to encode sequences. One popular applica- 
tion of this is to build character-level representa- 
tions of words (Kim et al., 2016). Another ex- 
ample is the capsule network architecture of Zhao 
et al. (2018), which uses a convolutional layer as 
an initial feature extractor over a sentence. 


Definition 5.1 (CNN acceptor). 


h; = tanh (W? (x14 ll. Xe+6) + b”) (25) 
h+ = maxpool(H) (26) 
p = o(W°h4 +b’). (27) 


In this network, the k-convolutional layer (25) 
produces a vector-valued sequence of outputs. 
This sequence is then collapsed to a fixed length 
by taking the maximum value of each filter over 
all the time steps (26). 

The CNN acceptor is much weaker than the 
LSTM. Since the vector h; has finite state, we 
see that L(CNN) C RL. Moreover, simple reg- 
ular languages like a*ba* are beyond the CNN 
(Lemma E.1). Thus, the subset relation is strict. 


Theorem 5.1 (CNN upper bound). 
L(CNN) c RL. 


So, to arrive at a characterization of CNNs, we 
should move to subregular languages. In par- 
ticular, we consider the strictly local languages 
(Rogers and Pullum, 2011). 


Theorem 5.2 (CNN lower bound). Let SL be the 
strictly local languages. Then, 


SL C L(CNN). 


Notably, strictly local formalisms have been 
proposed as a computational model for phonolog- 
ical grammar (Heinz et al., 2011). We might take 
this to explain why CNNs have been successful at 
modeling character-level information. 

However, Heinz et al. (2011) suggest that a gen- 
eralization to the tier-based strictly local languages 
is necessary to account for the full range of phono- 
logical phenomena. Tier-based strictly local gram- 
mars can target characters in a specific tier of the 
vocabulary (e.g. vowels) instead of applying to 
the full string. While a single convolutional layer 
cannot utilize tiers, it is conceivable that a more 
complex architecture with recurrent connections 
could. 


6 Empirical Results 


In this section, we compare our theoretical charac- 
terizations for asymptotic networks to the empiri- 
cal performance of trained neural networks with 
continuous logits.? 


6.1 Counting 


The goal of this experiment is to evaluate which 
architectures have memory beyond finite state. We 
train a language model on a”b”c with 5 < n < 
1000 and test it on longer strings (2000 < n < 
2200). Predicting the c character correctly while 
maintaining good overall accuracy requires O(n) 
states. The results reported in Table 1 demonstrate 
that all recurrent models, with only two hidden 
units, find a solution to this task that generalizes 
at least over this range of string lengths. 

Weiss et al. (2018) report failures in attempts 
to train SRNs and GRUs to accept counter lan- 
guages, unlike what we have found. We conjecture 
that this stems not from the requisite memory, but 
instead from the different objective function we 
used. Our language modeling training objective is 
a robust and transferable learning target (Radford 
et al., 2019), whereas sparse acceptance classifica- 
tion might be challenging to learn directly for long 
strings. 

Weiss et al. (2018) also observe that LSTMs 
use their memory as counters in a straightfor- 
wardly interpretable manner, whereas SRNs and 
GRUs do not do so in any obvious way. De- 
spite this, our results show that SRNs and GRUs 
are nonetheless able to implement generalizable 
counter memory while processing strings of sig- 
nificant length. Because the strategies learned by 
these architectures are not asymptotically stable, 
however, their schemes for encoding counting are 
less interpretable. 


6.2 Counting with Noise 


In order to abstract away from asymptotically un- 
stable representations, our next experiment inves- 
tigates how adding noise to an RNN’s activations 
impacts its ability to count. For the SRN and GRU, 
noise is added to h;_; before computing h;, and 
for the LSTM, noise is added to c;_1. In either 
case, the noise is sampled from the distribution 
N(0,0.1?). 


Shttps://github.com/viking-sudo-rm/ 
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No Noise Noise 
Py Acc Acconc| Acc Acconc 
SRN O(1) | 100.0 100.0 | 49.9 100.0 
GRU O(1) | 99.9 100.0 | 53.9 100.0 
LSTM O(n") | 99.9 100.0 | 99.9 100.0 


Table 1: Generalization performance of language models trained on a”b"c. Each model has 2 hidden units. 


Val Acc Gen Acc 
LSTM = O(n¥) | 94.0 51.6 
LSTM-Attn 2° | 100.0 51.7 
LSTM = O(n®) | 925 73.3 
StackNN 2° | 100.0 100.0 


Table 2: Max validation and generalization accuracies 
on string reversal over 10 trials. The top section shows 
our seq2seq LSTM with and without attention. The 
bottom reports the LSTM and StackNN results of Hao 
et al. (2018). Each LSTM has 10 hidden units. 


The results reported in the right column of Ta- 
ble 1 show that the noisy SRN and GRU now fail 
to count, whereas the noisy LSTM remains suc- 
cessful. Thus, the asymptotic characterization of 
each architecture matches the capacity of a trained 
network when a small amount of noise is intro- 
duced. 

From a practical perspective, training neural 
networks with Gaussian noise is one way of im- 
proving generalization by preventing overfitting 
(Bishop, 1995; Noh et al., 2017). From this point 
of view, asymptotic characterizations might be 
more descriptive of the generalization capacities 
of regularized neural networks of the sort neces- 
sary to learn the patterns in natural language data 
as opposed to the unregularized networks that are 
typically used to learn the patterns in carefully cu- 
rated formal languages. 


6.3 Reversing 


Another important formal language task for as- 
sessing network memory is string reversal. Re- 
versing requires remembering a O(n) prefix of 
characters, which implies 29() state complexity. 

We frame reversing as a seq2seq transduction 
task, and compare the performance of an LSTM 
encoder-decoder architecture to the same architec- 
ture augmented with attention. We also report the 
results of Hao et al. (2018) for a stack neural net- 
work (StackNN), another architecture with 29(n) 
state complexity (Lemma F.1). 

Following Hao et al. (2018), the models were 


trained on 800 random binary strings with length 
~ N(10, 2) and evaluated on strings with length 
~ N(50,5). As can be seen in Table 2, the LSTM 
with attention achieves 100.0% validation accu- 
racy, but fails to generalize to longer strings. In 
contrast, Hao et al. (2018) report that a stack neu- 
ral network can learn and generalize string rever- 
sal flawlessly. In both cases, it seems that having 
2°(") state complexity enables better performance 
on this memory-demanding task. However, our 
seq2seq LSTMs appear to be biased against find- 
ing a strategy that generalizes to longer strings. 


7 Conclusion 


We have introduced asymptotic acceptance as a 
new way to characterize neural networks as au- 
tomata of different sorts. It provides a useful and 
generalizable tool for building intuition about how 
a network works, as well as for comparing the 
formal properties of different architectures. Fur- 
ther, by combining asymptotic characterizations 
with existing results in mathematical linguistics, 
we can better assess the suitability of different ar- 
chitectures for the representation of natural lan- 
guage grammar. 

We observe empirically, however, that this dis- 
crete analysis fails to fully characterize the range 
of behaviors expressible by neural networks. In 
particular, RNNs predicted to be finite-state solve 
a task that requires more than finite memory. On 
the other hand, introducing a small amount of 
noise into a network’s activations seems to pre- 
vent it from implementing non-asymptotic strate- 
gies. Thus, asymptotic characterizations might be 
a good model for the types of generalizable strate- 
gies that noise-regularized neural networks trained 
on natural language data can learn. 
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A Asymptotic Acceptance and State 
Complexity 


Theorem A.1 (Arbitary approximation). Let 1 be 
a neural sequence acceptor for L. For all m, 
there exist parameters Om such that, for any string 
X1, -Xn With n < m, 


[i] = 1.0%) 


where |-| rounds to the nearest integer. 


Proof. Consider a string X. By the definition of 
asymptotic acceptance, there exists some number 
Mx which is the smallest number such that, for all 
N > Mx, 


Now, let Xm be the set of sentences X with length 
less than m. Since Xm is finite, we pick Om just 
by taking 


Ün = pax Mx. (30) 


m 


Theorem A.2 (General bound on state complex- 
ity). Let h; be a neural network hidden state. For 
any length n, it holds that 


Ady = 200, 


Proof. The number of configurations of h, can- 
not be more than the number of distinct inputs to 
the network. By construction, each x; is a one-hot 
vector over the alphabet X. Thus, the state com- 
plexity is bounded according to 


Mhn) < |X|" = 20), 


B SRN Lemmas 
Lemma B.1 (SRN lower bound). 


RL C L(SRN). 


Proof. We must show that any language accept- 
able by a finite-state machine is SRN-acceptable. 
We need to asymptotically compute a representa- 
tion of the machine’s state in h;. We do this by 
storing all values of the following finite predicate 
at each time step: 


öli a) 4> qli) Arn =a Gl) 


where q;(i) is true if the machine is in state 7 at 
time t. 

Let F be the set of accepting states for the ma- 
chine, and let 5~! be the inverse transition relation. 
Assuming h; asymptotically computes 6;, we can 
decide to accept or reject in the final layer accord- 
ing to the linearly separable disjunction 


a = VV VV ölj, @). 


i€F (j,a)€6-1(i) 


(32) 


We now show how to recurrently compute 6; at 
each time step. By rewriting q—1 in terms of the 
previous 0,—1 values, we get the following recur- 
rence: 


öli a) 4> re =a VV 
(j,8)€6-1 (i) 


ö:(j, B). 


(33) 
Since this formula is linearly separable, we can 
compute it in a single neural network layer from 
Xt and hı. 

Finally, we consider the base case. We need to 
ensure that transitions out of the initial state work 
out correctly at the first time step. We do this by 
adding a new memory unit f to h which is al- 
ways rewritten to have value 1. Thus, if f;-1 = 0, 
we can be sure we are in the initial time step. 
For each transition out of the initial state, we add 
f:—ı = 0 as an additional term to get 


0:(0,a) =} 4 =a 


(fi-1 =0 v VV 


(j,8)€5-* (0) 


dl, B) (34) 


This equation is still linearly separable and guar- 
antees that the initial step will be computed cor- 
rectly. 


C GRU Lemmas 


These results follow similar arguments to those in 
Subsection 3.1 and Appendix B. 


Lemma C.1 (GRU state complexity). The GRU 
hidden state has state complexity 


M(hn) = O(1). 


Proof. The configuration set of z+ is a subset of 
{0,1}*. Thus, we have two possibilities for each 
value of [h;];: either [h;—1]; or [u;];. Furthermore, 
the configuration set of [u+]; is a subset of {—1, 1}. 
Let S; be the configuration set of [h;];. We can 
describe S; according to 


So = {0} 
St C Sy_-1 U {-1, 1}. 


(35) 
(36) 
This implies that, at most, there are only three pos- 


sible values for each logit: —1, 0, or 1. Thus, the 
state complexity of h,, is 


M(h,,) < 3° = O(1). (37) 


Lemma C.2 (GRU lower bound). 
RL C L(GRU). 


Proof. We can simulate a finite-state machine us- 
ing the ð construction from Theorem 3.2. We 
compute values for the following predicate at each 
time step: 


dlia) 4> =a VV 
(9,8) €6—* (2) 


O1-1(J, B). 


(38) 
Since (38) is linearly separable, we can store 0; 
in our hidden state h, and recurrently compute its 
update. The base case can be handled similarly to 
(34). A final feedforward layer accepts or rejects 
according to (32). 


D Attention Lemmas 


Theorem D.1 (Theorem 4.1 restated). Let 
ti, .. tm be the subsequence of time steps that 
maximize qk;. Asymptotically, attention computes 


ni t K,V)= 1 — 
im attn (q, = ja ove 


Proof. Observe that, asymptotically, softmax(u) 
approaches a function 


= max(u) 
00 0 otherwise. 
(39) 
Thus, the output of the attention mechanism re- 
duces to the sum 


1 . 
— if 
Jim softmax(Nu); = : re 


m 


: 1 
lim ) —Vi;- 
Now <4 1 m 


i= 


(40) 


Lemma D.1 (Theorem 4.2 restated). The full state 
of the attention layer has state complexity 


(Vn) = 20. 


Proof. By the general upper bound on state com- 
plexity (Theorem A.2), we know that M(V;,) = 
20(") We now show the lower bound. 

We pick weights 0 in the encoder such that v; = 
x+. Thus, (v9) = |£] for all t. Since the values 
at each time step are independent, we know that 


A(Vn) = IE” 
A(Vp) = 22), 


(41) 
(42) 


Lemma D.2 (Theorem 4.3 restated). The attention 
summary vector has state complexity 


Proof. By Theorem 4.1, we know that 


lim hp = lim TI (43) 


N->oo Noo m 


By construction, there is a finite set S containing 
all possible configurations of every v+. We bound 
the number of configurations for each v+, by |.S| to 
get 


n 


S_|S|m < |S|n? = O(n’). 


m=1 


M(hn) < (44) 


Lemma D.3 (Attention state complexity lower 
bound). The attention summary vector has state 
complexity 


Proof. Consider the case where keys and values 
have dimension 1. Further, let the input strings 
come from a binary alphabet © = {0,1}. We pick 
parameters 0 in the encoder such that, for all t, 


0 ifx=0 
1 otherwise 


and liMmNy—oo kų = 1. Then, attention returns 


(45) 


lim Ut = 
N->co 


(46) 


where l is the number of t such that x; = 1. We 
can vary the input to produce / from 1 to n. Thus, 
we have 


(47) 
(48) 


Lemma D.4 (Attention state complexity with 
unique maximum). If, for all X, there exists a 
unique t* such that t* = max; qnky, then 


Proof. If qnk has a unique maximum, then by 
Corollary 4.1.1 attention returns 
(49) 


lim argmax qk; = lim vy. 
N-0o Vt N-0o 


By construction, there is a finite set S which is a 
superset of the configuration set of v,;+. Thus, 


(hn) < [S| = O(1). (50) 


Lemma D.5 (Attention state complexity with 
ReLU activations). If limy vi € {0,00}* for 
1<t<_n, then 


Proof. By Theorem 4.1, we know that attention 
computes 


m 


3 : 1 

lim h, = lim — J Vi. 
N- co Noo m ¢ i 

— 


(51) 
This sum evaluates to a vector in {0,00}*, which 
means that 


Pi(h,,) < 2* = O(1). (52) 


Lemma D.5 applies if the sequence vj, .., Vn is 
computed as the output of ReLU. A similar re- 
sult holds if it is computed as the output of an un- 
squashed linear transformation. 


E CNN Lemmas 


Lemma E.1 (CNN counterexample). 
a*ba* ¢ L(CNN). 


Proof. By contradiction. Assume we can write 
a network with window size k that accepts any 
string with exactly one b and reject any other 
string. Consider a string with two bs at indices i 
and j where |i — j| > 2k + 1. Then, no column 
in the network receives both x; and x; as input. 
When we replace one b with an a, the value of 
h, remains the same. Since the value of h (26) 
fully determines acceptance, the network does not 
accept this new string. However, the string now 
contains exactly one b, so we reach a contradic- 
tion. 


Definition E.1 (Strictly k-local grammar). A 
strictly k-local grammar over an alphabet X is a 
set of allowable k-grams S. Each s € S takes the 
form 


se (SU{#})* 


where # is a padding symbol for the start and end 
of sentences. 


Definition E.2 (Strictly local acceptance). A 
strictly k-local grammar S accepts a string ø if, 
at each index i, 


O70} 41005441 E S. 


Lemma E.2 (Implies Theorem 5.2). A k-CNN can 
asymptotically accept any strictly 2k+-1-local lan- 


guage. 


Proof. We construct a k-CNN to simulate a 
strictly 2k + 1-local grammar. In the convolutional 
layer (25), each filter identifies whether a particu- 
lar invalid 2k + 1-gram is matched. This condition 
is a conjunction of one-hot terms, so we use tanh 
to construct a linear transformation that comes out 
to 1 if a particular invalid sequence is matched, 
and —1 otherwise. 

Next, the pooling layer (26) collapses the filter 
values at each time step. A pooled filter will be 
1 if the invalid sequence it detects was matched 
somewhere and —1 otherwise. 

Finally, we decide acceptance (27) by verifying 
that no invalid pattern was detected. To do this, 
we assign each filter a weight of —1 use a thresh- 
old of -K + 5 where K is the number of invalid 
patterns. If any filter has value 1, then this sum 
will be negative. Otherwise, it will be 5. Thus, 
asymptotic sigmoid will give us a correct accep- 
tance decision. 


F Neural Stack Lemmas 


Refer to Hao et al. (2018) for a definition of the 
StackNN architecture. The architecture utilizes a 
differentiable data structure called a neural stack. 
We show that this data structure has 2° state 
complexity. 


Lemma F.1 (Neural stack state complexity). Let 
S, € R™ be a neural stack with a feedforward 
controller. Then, 


P1(Sp) = 29), 


Proof. By the general state complexity bound 
(Theorem A.2), we know that M (Sp) = 20. We 
now show the lower bound. 

The stack at time step n is a matrix Sp € R™ 
where the rows correspond to vectors that have 
been pushed during the previous time steps. We 
set the weights of the controller 0 such that, at 
each step, we pop with strength 0 and push x; with 
strength 1. Then, we have 


M(S}) = |E” (53) 
- M(Sn) = 22), (54) 


